Search CORE

82 research outputs found

CRL at Ntcir2

Author: Isahara Hitoshi
Ma Qing
Murata Masaki
Ozaku Hiromi
Utiyama Masao
Publication venue
Publication date: 01/01/2001
Field of study

We have developed systems of two types for NTCIR2. One is an enhenced version of the system we developed for NTCIR1 and IREX. It submitted retrieval results for JJ and CC tasks. A variety of parameters were tried with the system. It used such characteristics of newspapers as locational information in the CC tasks. The system got good results for both of the tasks. The other system is a portable system which avoids free parameters as much as possible. The system submitted retrieval results for JJ, JE, EE, EJ, and CC tasks. The system automatically determined the number of top documents and the weight of the original query used in automatic-feedback retrieval. It also determined relevant terms quite robustly. For EJ and JE tasks, it used document expansion to augment the initial queries. It achieved good results, except on the CC tasks.Comment: 11 pages. Computation and Language. This paper describes our results of information retrieval in the NTCIR2 contes

arXiv.org e-Print Archive

CiteSeerX

Dynamic Sentence Sampling for Efficient Training of Neural Machine Translation

Author: Sumita Eiichiro
Utiyama Masao
Wang Rui
Publication venue
Publication date: 01/01/2018
Field of study

Traditional Neural machine translation (NMT) involves a fixed training procedure where each sentence is sampled once during each epoch. In reality, some sentences are well-learned during the initial few epochs; however, using this approach, the well-learned sentences would continue to be trained along with those sentences that were not well learned for 10-30 epochs, which results in a wastage of time. Here, we propose an efficient method to dynamically sample the sentences in order to accelerate the NMT training. In this approach, a weight is assigned to each sentence based on the measured difference between the training costs of two iterations. Further, in each epoch, a certain percentage of sentences are dynamically sampled according to their weights. Empirical results based on the NIST Chinese-to-English and the WMT English-to-German tasks depict that the proposed method can significantly accelerate the NMT training and improve the NMT performance.Comment: Revised version of ACL-201

arXiv.org e-Print Archive

Crossref

ScrumSourcing: Challenges of Collaborative Post-editing for Rugby World Cup 2019

Author: Hartley Anthony
He Beibei
Isahara Hitoshi
Sumita Eiichiro
Utiyama Masao
Publication venue: Ediciones Universidad de Salamanca (España)
Publication date: 01/01/2018
Field of study

This paper describes challenges facing the ScrumSourcing project to create a neural machine translation (NMT) service aiding interaction between Japanese- and English-speaking fans during Rugby World Cup 2019 in Japan. This is an example of «domain adaptation». The best training data for adapting NMT is large volumes of translated sentences typical of the domain. In reality, however, such parallel data for rugby does not exist. The problem is compounded by a marked asymmetry between the two languages in conventions for post-match reports; and the almost total absence of in-match commentaries in Japanese. In post-editing the NMT output to incrementally improve quality via retraining, volunteer rugby fans will play a crucial role in determining a new genre in Japanese. To avoid de-motivating the volunteers at the outset we undertake an initial adaptation of the system using terminological data. This paper describes the compilation of this data and its effects on the quality of the systems’ output.Este documento describe los retos a los que se enfrenta el proyecto ScrumSourcing para crear un servicio de traducción automática neuronal (NMT) que ayude a la interacción entre los aficionados de habla japonesa e inglesa durante la Copa Mundial de Rugby de 2019 en Japón. Este es un ejemplo de «adaptación al dominio». Los mejores datos de entrenamiento para adaptar la NMT son grandes volúmenes de oraciones traducidas típicas del dominio. Sin embargo, en la realidad no existen tales datos paralelos para el rugby. El problema se agrava por una marcada asimetría entre las dos lenguas en las convenciones para los informes posteriores al partido y la ausencia casi total de comentarios emitidos en directo durante el partido en japonés. En la post-edición de la producción de la NMT para mejorar de forma incremental la calidad a través del reentrenamiento, los voluntarios aficionados al rugby desempeñarán un papel crucial en la determinación de un nuevo género en japonés. Para evitar desmotivar a los voluntarios desde el principio, emprenderemos una adaptación inicial del sistema utilizando datos terminológicos. Este documento describe la compilación de estos datos y sus efectos en la calidad de la producción de los sistemas

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Gestion del Repositorio Documental de la Universidad de Salamanca

Improving fast align by reordering,”

Author: Chenchen Ding
Eiichiro Sumita
Masao Utiyama
Publication venue
Publication date: 01/01/2015
Field of study

Abstract fast align is a simple, fast, and efficient approach for word alignment based on the IBM model 2. fast align performs well for language pairs with relatively similar word orders; however, it does not perform well for language pairs with drastically different word orders. We propose a segmenting-reversing reordering process to solve this problem by alternately applying fast align and reordering source sentences during training. Experimental results with JapaneseEnglish translation demonstrate that the proposed approach improves the performance of fast align significantly without the loss of efficiency. Experiments using other languages are also reported

CiteSeerX